In [1]:
library(ISLR)
In [59]:
attach(Weekly)
In [2]:
summary(Weekly)
      Year           Lag1               Lag2               Lag3         
 Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
 Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
 Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
 3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
      Lag4               Lag5              Volume            Today         
 Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
 1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
 Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
 Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
 3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
 Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
 Direction 
 Down:484  
 Up  :605  
           
           
           
           
In [5]:
par(mfrow=c(2,3))
hist(Weekly$Lag1)
hist(Weekly$Lag2)
hist(Weekly$Lag3)
hist(Weekly$Lag4)
hist(Weekly$Lag5)
hist(Weekly$Volume)
In [6]:
log1 <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
In [7]:
summary(log1)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6949  -1.2565   0.9913   1.0849   1.4579  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Lag2 seems to be statistically significant with p < 0.05

In [8]:
log1.pred <- predict(log1, type="response")
In [10]:
log1.response <- rep("Down", nrow(Weekly))
In [11]:
log1.response[log1.pred > 0.5] = "Up"
In [13]:
table(log1.response, Weekly$Direction)
             
log1.response Down  Up
         Down   54  48
         Up    430 557
In [16]:
mean(log1.response == Weekly$Direction)
0.561065197428834
In [17]:
(54+557)/1089
0.561065197428834

100 - 56.1 = 43.9 is the training error rate. The training error rate is often overly optimistic - it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model's performance not on the data that we used to fit the model, but rather on days in the future for which the market's movements are unknown.

In [24]:
training <- Weekly$Year < 2009
In [25]:
Weekly.2009 <- Weekly[!training,]
In [26]:
dim(Weekly.2009)
  1. 104
  2. 9
In [27]:
direction.2009 <- Weekly$Direction[!training]
In [28]:
glm.fit <- glm(Direction~Lag2, data=Weekly, family=binomial, subset=training)
In [29]:
glm.probs = predict(glm.fit, Weekly.2009, type="response")
In [31]:
glm.pred <- rep("Down", length(direction.2009))
In [32]:
glm.pred[glm.probs > 0.5] = "Up"
In [33]:
table(glm.pred, direction.2009)
        direction.2009
glm.pred Down Up
    Down    9  5
    Up     34 56
In [34]:
mean(glm.pred == direction.2009)
0.625
In [35]:
library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = training)
lda.pred = predict(lda.fit, Weekly.2009)
table(lda.pred$class, direction.2009)
      direction.2009
       Down Up
  Down    9  5
  Up     34 56
In [38]:
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = training)
qda.class = predict(qda.fit, Weekly.2009)$class
table(qda.class, direction.2009)
         direction.2009
qda.class Down Up
     Down    0  0
     Up     43 61
In [69]:
mean(qda.class==direction.2009)
0.586538461538462
In [63]:
train.X = as.matrix(Lag2[training])
In [64]:
test.X = as.matrix(Lag2[!training])
In [65]:
train.Y = as.matrix(Direction[training])
In [66]:
library(class)
set.seed(1)
knn.pred = knn(train.X, test.X, train.Y, k=1)
In [67]:
table(knn.pred, direction.2009)
        direction.2009
knn.pred Down Up
    Down   21 30
    Up     22 31
In [68]:
mean(knn.pred == direction.2009)
0.5

LDA provides the best result on this data

In [71]:
Auto$mpg01 <- rep(0, nrow(Auto))
In [73]:
Auto$mpg01[Auto$mpg > median(Auto$mpg)] = 1
In [87]:
nrow(Auto)
392
In [78]:
library(corrplot)
In [84]:
cor(Auto[,-9])
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginmpg01
mpg 1.0000000-0.7776175-0.8051269-0.7784268-0.8322442 0.4233285 0.5805410 0.5652088 0.8369392
cylinders-0.7776175 1.0000000 0.9508233 0.8429834 0.8975273-0.5046834-0.3456474-0.5689316-0.7591939
displacement-0.8051269 0.9508233 1.0000000 0.8972570 0.9329944-0.5438005-0.3698552-0.6145351-0.7534766
horsepower-0.7784268 0.8429834 0.8972570 1.0000000 0.8645377-0.6891955-0.4163615-0.4551715-0.6670526
weight-0.8322442 0.8975273 0.9329944 0.8645377 1.0000000-0.4168392-0.3091199-0.5850054-0.7577566
acceleration 0.4233285-0.5046834-0.5438005-0.6891955-0.4168392 1.0000000 0.2903161 0.2127458 0.3468215
year 0.5805410-0.3456474-0.3698552-0.4163615-0.3091199 0.2903161 1.0000000 0.1815277 0.4299042
origin 0.5652088-0.5689316-0.6145351-0.4551715-0.5850054 0.2127458 0.1815277 1.0000000 0.5136984
mpg01 0.8369392-0.7591939-0.7534766-0.6670526-0.7577566 0.3468215 0.4299042 0.5136984 1.0000000
In [86]:
pairs(Auto)
In [ ]: